from google.colab import drive
drive.mount('/content/drive',force_remount=True)
Mounted at /content/drive
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Q1.A Read ‘Car name.csv’ as a DataFrame and assign it to a variable
df1 = pd.read_csv('/content/drive/MyDrive/Car_name.csv')
df1.head()
| car_name | |
|---|---|
| 0 | chevrolet chevelle malibu |
| 1 | buick skylark 320 |
| 2 | plymouth satellite |
| 3 | amc rebel sst |
| 4 | ford torino |
Q1.B Read ‘Car-Attributes.json as a DataFrame and assign it to a variable.
df2 = pd.read_json('/content/drive/MyDrive/Car-Attributes.json')
df2.head()
| mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 |
| 1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 |
| 3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 |
| 4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 |
1.C Merge both the DataFrames together to form a single DataFrame
print("The Shape of car name df1 dataframe", df1.shape)
print("The Shape of car attribute name df2 dataframe", df2.shape)
The Shape of car name df1 dataframe (398, 1) The Shape of car attribute name df2 dataframe (398, 8)
# As the number of rows are same on both dataset.so directlymere both dataset.
df = [df1,df2]
df = pd.concat(df,axis=1)
df.head()
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 |
| 1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 |
| 2 | plymouth satellite | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 |
| 3 | amc rebel sst | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 |
| 4 | ford torino | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 |
print("The shapeof new combined dataset",df.shape)
The shapeof new combined dataset (398, 9)
Q1.D Print 5 point summary of the numerical features and share insights.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| mpg | 398.0 | 23.514573 | 7.815984 | 9.0 | 17.500 | 23.0 | 29.000 | 46.6 |
| cyl | 398.0 | 5.454774 | 1.701004 | 3.0 | 4.000 | 4.0 | 8.000 | 8.0 |
| disp | 398.0 | 193.425879 | 104.269838 | 68.0 | 104.250 | 148.5 | 262.000 | 455.0 |
| wt | 398.0 | 2970.424623 | 846.841774 | 1613.0 | 2223.750 | 2803.5 | 3608.000 | 5140.0 |
| acc | 398.0 | 15.568090 | 2.757689 | 8.0 | 13.825 | 15.5 | 17.175 | 24.8 |
| yr | 398.0 | 76.010050 | 3.697627 | 70.0 | 73.000 | 76.0 | 79.000 | 82.0 |
| origin | 398.0 | 1.572864 | 0.802055 | 1.0 | 1.000 | 1.0 | 2.000 | 3.0 |
hp column analysis is missing.Check for any impute data.
imputeHP = pd.DataFrame(df.hp.str.isdigit())
df[imputeHP['hp'] == False]
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|---|
| 32 | ford pinto | 25.0 | 4 | 98.0 | ? | 2046 | 19.0 | 71 | 1 |
| 126 | ford maverick | 21.0 | 6 | 200.0 | ? | 2875 | 17.0 | 74 | 1 |
| 330 | renault lecar deluxe | 40.9 | 4 | 85.0 | ? | 1835 | 17.3 | 80 | 2 |
| 336 | ford mustang cobra | 23.6 | 4 | 140.0 | ? | 2905 | 14.3 | 80 | 1 |
| 354 | renault 18i | 34.5 | 4 | 100.0 | ? | 2320 | 15.8 | 81 | 2 |
| 374 | amc concord dl | 23.0 | 4 | 151.0 | ? | 3035 | 20.5 | 82 | 1 |
As it is only 6 rows which does not have hpvalue so dropping these 6 rows.
df.drop(df[df['hp']=='?'].index, inplace=True)
df.head()
| car_name | mpg | cyl | disp | hp | wt | acc | yr | origin | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 |
| 1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 |
| 2 | plymouth satellite | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 |
| 3 | amc rebel sst | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 |
| 4 | ford torino | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 |
#convert hp fromobject to float.
df['hp'] = pd.to_numeric(df['hp'])
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| mpg | 392.0 | 23.445918 | 7.805007 | 9.0 | 17.000 | 22.75 | 29.000 | 46.6 |
| cyl | 392.0 | 5.471939 | 1.705783 | 3.0 | 4.000 | 4.00 | 8.000 | 8.0 |
| disp | 392.0 | 194.411990 | 104.644004 | 68.0 | 105.000 | 151.00 | 275.750 | 455.0 |
| hp | 392.0 | 104.469388 | 38.491160 | 46.0 | 75.000 | 93.50 | 126.000 | 230.0 |
| wt | 392.0 | 2977.584184 | 849.402560 | 1613.0 | 2225.250 | 2803.50 | 3614.750 | 5140.0 |
| acc | 392.0 | 15.541327 | 2.758864 | 8.0 | 13.775 | 15.50 | 17.025 | 24.8 |
| yr | 392.0 | 75.979592 | 3.683737 | 70.0 | 73.000 | 76.00 | 79.000 | 82.0 |
| origin | 392.0 | 1.576531 | 0.805518 | 1.0 | 1.000 | 1.00 | 2.000 | 3.0 |
2. Data Preparation & Analysis:
2.A Check and print feature-wise percentage of missing values present in the data and impute with the best suitable approach.
df.isna().sum()/len(df)*100
car_name 0.0 mpg 0.0 cyl 0.0 disp 0.0 hp 0.0 wt 0.0 acc 0.0 yr 0.0 origin 0.0 dtype: float64
The dataset has no missing values
Q2.B Check for duplicate values in the data and impute with the best suitable approach
df.duplicated().sum()
0
No row is duplicated.
Q2.C Plot a pairplot for all features.
sns.pairplot(df,hue='cyl',palette='tab10')
<seaborn.axisgrid.PairGrid at 0x7c9ee23d5240>
Q2.D Visualize a scatterplot for ‘wt’ and ‘disp’. Datapoints should be distinguishable by ‘cyl’.
sns.scatterplot(x=df['wt'],y=df['disp'],hue=df['cyl'],palette='tab10')
<Axes: xlabel='wt', ylabel='disp'>
Q2.E Share insights for Q2.d.
There is a positive correlation between "wt" and "disp". If number of cylinders are increased, "wt: and "disp" is also increased. In dataset 3 and 5 cylider vehciles count is very less.
Q2.F Visualize a scatterplot for ‘wt’ and ’mpg’. Datapoints should be distinguishable by ‘cyl’.
sns.scatterplot(x=df['wt'],y=df['mpg'],hue=df['cyl'],palette='tab10')
<Axes: xlabel='wt', ylabel='mpg'>
Ques 2. G Share insights for Q2.f.
1.There is a negative correlation between mpg and wt.
Q2.H Check for unexpected values in all the features and datapoints with such values
Solution is covered in 1.D answer. There was 6 rows which has ? value. I dropped them.
Numeric_columns=['mpg', 'cyl', 'disp', 'hp', 'wt', 'acc', 'yr', 'origin']
for i in Numeric_columns:
Q1=df[i].quantile(0.25)
Q3=df[i].quantile(0.75)
IQR=Q3-Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
outiler_status=((df[i]<lower)|(df[i]>upper)).sum()
print("The outlier count of", i,"is", outiler_status)
The outlier count of mpg is 0 The outlier count of cyl is 0 The outlier count of disp is 0 The outlier count of hp is 10 The outlier count of wt is 0 The outlier count of acc is 11 The outlier count of yr is 0 The outlier count of origin is 0
Hp and Acc columns has outliers.
df.isna().sum()
car_name 0 mpg 0 cyl 0 disp 0 hp 0 wt 0 acc 0 yr 0 origin 0 dtype: int64
Clustering
from sklearn.cluster import KMeans
from scipy.stats import zscore
#drop the car name column
df_1= df.drop(['yr','origin','car_name'],axis=1)
#Scale the data
df_scaled = df_1.apply(zscore)
df_scaled.sample(5)
| mpg | cyl | disp | hp | wt | acc | |
|---|---|---|---|---|---|---|
| 25 | -1.724931 | 1.483947 | 1.584416 | 2.875254 | 1.930190 | -0.559396 |
| 139 | -1.211785 | 1.483947 | 1.029447 | 0.924265 | 1.957303 | 0.166467 |
| 168 | -0.057205 | -0.864014 | -0.520637 | -0.558487 | -0.399124 | 0.529398 |
| 395 | 1.097374 | -0.864014 | -0.568479 | -0.532474 | -0.804632 | -1.430430 |
| 69 | -1.468358 | 1.483947 | 1.488732 | 1.444529 | 1.742760 | -0.740861 |
#As per the question I am running the loop from 2 to 11 to apply K-Means clustering for 2 to 10 clusters.
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')
clusters = []
for i in range(2, 11):
kmodel = KMeans(n_clusters=i,random_state=1).fit(df_scaled)
clusters.append(kmodel.inertia_)
print("The K means model inertia for cluster",i,"is",kmodel.inertia_)
fig, ax = plt.subplots(figsize=(4, 4))
sns.lineplot(x=list(range(2, 11)), y=clusters, ax=ax,marker='o')
ax.set_title('Find the elbow point')
ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')
The K means model inertia for cluster 2 is 927.6954635551294 The K means model inertia for cluster 3 is 596.5247365547316 The K means model inertia for cluster 4 is 482.10788335892926 The K means model inertia for cluster 5 is 416.4942101850577 The K means model inertia for cluster 6 is 360.2140165537845 The K means model inertia for cluster 7 is 327.2083846469888 The K means model inertia for cluster 8 is 295.9354327620717 The K means model inertia for cluster 9 is 278.73123357763905 The K means model inertia for cluster 10 is 264.2567939129071
Text(0, 0.5, 'Inertia')
# for better under standing Again running K-Means clustering for 1 to 10 clusters.
clusters = []
for i in range(1, 11):
kmodel = KMeans(n_clusters=i,random_state=1).fit(df_scaled)
clusters.append(kmodel.inertia_)
print("The K means model inertia for cluster",i,"is",kmodel.inertia_)
fig, ax = plt.subplots(figsize=(4, 4))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax,marker='o')
ax.set_title('Find the elbow point')
ax.set_xlabel('Clusters')
ax.set_ylabel('Inertia')
arrowprops = dict(
arrowstyle = "->",
connectionstyle = "angle, angleA =-180, angleB = 45,\
rad = 10")
offset = 50
ax.annotate("possible points",
(3, 596), xytext =(offset, offset),
textcoords ='offset points',arrowprops = arrowprops)
ax.annotate("possible points",
(5, 416), xytext =( offset, offset),
textcoords ='offset points',arrowprops = arrowprops)
ax.annotate("possible points",
(6, 360), xytext =( offset, offset),
textcoords ='offset points',arrowprops = arrowprops)
The K means model inertia for cluster 1 is 2352.000000000001 The K means model inertia for cluster 2 is 927.6954635551294 The K means model inertia for cluster 3 is 596.5247365547316 The K means model inertia for cluster 4 is 482.10788335892926 The K means model inertia for cluster 5 is 416.4942101850577 The K means model inertia for cluster 6 is 360.2140165537845 The K means model inertia for cluster 7 is 327.2083846469888 The K means model inertia for cluster 8 is 295.9354327620717 The K means model inertia for cluster 9 is 278.73123357763905 The K means model inertia for cluster 10 is 264.2567939129071
Text(50, 50, 'possible points')
from sklearn.metrics import silhouette_samples, silhouette_score
df_scaled.sample(5)
| mpg | cyl | disp | hp | wt | acc | |
|---|---|---|---|---|---|---|
| 371 | 0.712514 | -0.864014 | -0.568479 | -0.532474 | -0.533507 | 0.166467 |
| 138 | -1.211785 | 1.483947 | 1.182542 | 1.184397 | 1.743939 | -0.740861 |
| 312 | 1.764465 | -0.864014 | -1.037332 | -1.026725 | -1.129982 | 0.311639 |
| 210 | -0.570352 | 0.309967 | -0.367542 | 0.091842 | -0.056092 | -0.014999 |
| 215 | -1.340071 | 1.483947 | 1.182542 | 1.184397 | 0.916420 | -0.559396 |
The possibl elbow points can be 3,5,6
Q3.D Train a K-means clustering model once again on the optimal number of clusters. Q3.E. Add a new feature in the DataFrame which will have labels based upon cluster value. Q3.F. Plot a visual and color the datapoints based upon clusters.
p=[2,3,4,5,6,7,8]
for f in p:
km= KMeans(n_clusters=f,random_state=1).fit(df_scaled)
labels = km.labels_
print("Silhouette_score for elbow point",f,"is",silhouette_score(df_scaled,labels))
Silhouette_score for elbow point 2 is 0.5450184683536872 Silhouette_score for elbow point 3 is 0.44234710113179243 Silhouette_score for elbow point 4 is 0.3816218513467549 Silhouette_score for elbow point 5 is 0.36979033611463874 Silhouette_score for elbow point 6 is 0.33150999377045476 Silhouette_score for elbow point 7 is 0.3041083061385893 Silhouette_score for elbow point 8 is 0.2957075837064921
import matplotlib.pyplot as plt
import seaborn as sns
width = 8
height = 4
s_score=[]
sns.set(rc = {'figure.figsize':(width,height)})
elbow_points=[3,5,6]
for i in elbow_points:
km= KMeans(n_clusters=i,random_state=10).fit(df_scaled)
z='Labels_'+str(i)
df_scaled[z] = km.labels_
fig, axes = plt.subplots(1, 2)
fig.subplots_adjust(hspace=0.125, wspace=.5)
sns.scatterplot(x=df_scaled['wt'], y=df_scaled['mpg'], hue=df_scaled[z], ax = axes[0])
sns.scatterplot(x=df_scaled['wt'], y=df_scaled['hp'], hue=df_scaled[z], ax = axes[1])
df_scaled.head()
| mpg | cyl | disp | hp | wt | acc | Labels_3 | Labels_5 | Labels_6 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.698638 | 1.483947 | 1.077290 | 0.664133 | 0.620540 | -1.285258 | 1 | 1 | 5 |
| 1 | -1.083498 | 1.483947 | 1.488732 | 1.574594 | 0.843334 | -1.466724 | 1 | 1 | 5 |
| 2 | -0.698638 | 1.483947 | 1.182542 | 1.184397 | 0.540382 | -1.648189 | 1 | 1 | 5 |
| 3 | -0.955212 | 1.483947 | 1.048584 | 1.184397 | 0.536845 | -1.285258 | 1 | 1 | 5 |
| 4 | -0.826925 | 1.483947 | 1.029447 | 0.924265 | 0.555706 | -1.829655 | 1 | 1 | 5 |
df_1.tail(5)
| mpg | cyl | disp | hp | wt | acc | |
|---|---|---|---|---|---|---|
| 393 | 27.0 | 4 | 140.0 | 86 | 2790 | 15.6 |
| 394 | 44.0 | 4 | 97.0 | 52 | 2130 | 24.6 |
| 395 | 32.0 | 4 | 135.0 | 84 | 2295 | 11.6 |
| 396 | 28.0 | 4 | 120.0 | 79 | 2625 | 18.6 |
| 397 | 31.0 | 4 | 119.0 | 82 | 2720 | 19.4 |
df_scaled.tail()
| mpg | cyl | disp | hp | wt | acc | Labels_3 | Labels_5 | Labels_6 | |
|---|---|---|---|---|---|---|---|---|---|
| 393 | 0.455941 | -0.864014 | -0.520637 | -0.480448 | -0.221125 | 0.021294 | 2 | 4 | 2 |
| 394 | 2.636813 | -0.864014 | -0.932079 | -1.364896 | -0.999134 | 3.287676 | 2 | 3 | 4 |
| 395 | 1.097374 | -0.864014 | -0.568479 | -0.532474 | -0.804632 | -1.430430 | 2 | 0 | 3 |
| 396 | 0.584228 | -0.864014 | -0.712005 | -0.662540 | -0.415627 | 1.110088 | 2 | 3 | 4 |
| 397 | 0.969088 | -0.864014 | -0.721574 | -0.584501 | -0.303641 | 1.400433 | 2 | 3 | 4 |
3G. Pass a new DataPoint and predict which cluster it belongs to.
km_3 = KMeans(n_clusters=3, random_state=10).fit(df_1)
cluster=km_3.predict([[28,4,120,79,2625,18.6]])[0]
cluster
2
km_6 = KMeans(n_clusters=6, random_state=5).fit(df_1)
cluster=km_6.predict([[28,4,120,79,2625,18.6]])[0]
cluster
4
km_5 = KMeans(n_clusters=5).fit(df_1)
cluster=km_5.predict([[30,6,123,120,3000,18.6]])[0]
cluster
0
PART- B
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from scipy.stats import zscore
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score
from sklearn import metrics
from sklearn.metrics import classification_report
1. Data Understanding & Cleaning:
A. Read ‘vehicle.csv’ and save as DataFrame.
df_vehicle = pd.read_csv('/content/drive/MyDrive/vehicle.csv')
df_vehicle.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95 | 48.0 | 83.0 | 178.0 | 72.0 | 10 | 162.0 | 42.0 | 20.0 | 159 | 176.0 | 379.0 | 184.0 | 70.0 | 6.0 | 16.0 | 187.0 | 197 | van |
| 1 | 91 | 41.0 | 84.0 | 141.0 | 57.0 | 9 | 149.0 | 45.0 | 19.0 | 143 | 170.0 | 330.0 | 158.0 | 72.0 | 9.0 | 14.0 | 189.0 | 199 | van |
| 2 | 104 | 50.0 | 106.0 | 209.0 | 66.0 | 10 | 207.0 | 32.0 | 23.0 | 158 | 223.0 | 635.0 | 220.0 | 73.0 | 14.0 | 9.0 | 188.0 | 196 | car |
| 3 | 93 | 41.0 | 82.0 | 159.0 | 63.0 | 9 | 144.0 | 46.0 | 19.0 | 143 | 160.0 | 309.0 | 127.0 | 63.0 | 6.0 | 10.0 | 199.0 | 207 | van |
| 4 | 85 | 44.0 | 70.0 | 205.0 | 103.0 | 52 | 149.0 | 45.0 | 19.0 | 144 | 241.0 | 325.0 | 188.0 | 127.0 | 9.0 | 11.0 | 180.0 | 183 | bus |
df_vehicle.shape
(846, 19)
1.B Check percentage of missing values and impute with correct approach.
df_vehicle.isna().sum()/len(df_vehicle)*100
compactness 0.000000 circularity 0.591017 distance_circularity 0.472813 radius_ratio 0.709220 pr.axis_aspect_ratio 0.236407 max.length_aspect_ratio 0.000000 scatter_ratio 0.118203 elongatedness 0.118203 pr.axis_rectangularity 0.354610 max.length_rectangularity 0.000000 scaled_variance 0.354610 scaled_variance.1 0.236407 scaled_radius_of_gyration 0.236407 scaled_radius_of_gyration.1 0.472813 skewness_about 0.709220 skewness_about.1 0.118203 skewness_about.2 0.118203 hollows_ratio 0.000000 class 0.000000 dtype: float64
df_vehicle.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 846 entries, 0 to 845 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 compactness 846 non-null int64 1 circularity 841 non-null float64 2 distance_circularity 842 non-null float64 3 radius_ratio 840 non-null float64 4 pr.axis_aspect_ratio 844 non-null float64 5 max.length_aspect_ratio 846 non-null int64 6 scatter_ratio 845 non-null float64 7 elongatedness 845 non-null float64 8 pr.axis_rectangularity 843 non-null float64 9 max.length_rectangularity 846 non-null int64 10 scaled_variance 843 non-null float64 11 scaled_variance.1 844 non-null float64 12 scaled_radius_of_gyration 844 non-null float64 13 scaled_radius_of_gyration.1 842 non-null float64 14 skewness_about 840 non-null float64 15 skewness_about.1 845 non-null float64 16 skewness_about.2 845 non-null float64 17 hollows_ratio 846 non-null int64 18 class 846 non-null object dtypes: float64(14), int64(4), object(1) memory usage: 125.7+ KB
df_vehicle = df_vehicle.replace(' ', np.nan)
for i in df_vehicle.columns[:18]:
var = df_vehicle[i].median()
df_vehicle[i] = df_vehicle[i].fillna(var)
df_vehicle.isna().sum()/len(df_vehicle)*100
compactness 0.0 circularity 0.0 distance_circularity 0.0 radius_ratio 0.0 pr.axis_aspect_ratio 0.0 max.length_aspect_ratio 0.0 scatter_ratio 0.0 elongatedness 0.0 pr.axis_rectangularity 0.0 max.length_rectangularity 0.0 scaled_variance 0.0 scaled_variance.1 0.0 scaled_radius_of_gyration 0.0 scaled_radius_of_gyration.1 0.0 skewness_about 0.0 skewness_about.1 0.0 skewness_about.2 0.0 hollows_ratio 0.0 class 0.0 dtype: float64
All the missing values are handled.
1.C Visualize a Pie-chart and print percentage of values for variable ‘class’.
df_vehicle['class'].value_counts()
car 429 bus 218 van 199 Name: class, dtype: int64
df_vehicle['class'].value_counts().plot(kind='pie',autopct='%1.1f%%')
<Axes: ylabel='class'>
1.D Check for duplicate rows in the data and impute with correct approach
df_vehicle.duplicated().sum()
0
2. Data Preparation:
A. Split data into X and Y. [Train and Test optional]
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder().fit(df_vehicle['class'])
df_vehicle['class'] = le.transform(df_vehicle['class'])
df_vehicle.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 846 entries, 0 to 845 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 compactness 846 non-null int64 1 circularity 846 non-null float64 2 distance_circularity 846 non-null float64 3 radius_ratio 846 non-null float64 4 pr.axis_aspect_ratio 846 non-null float64 5 max.length_aspect_ratio 846 non-null int64 6 scatter_ratio 846 non-null float64 7 elongatedness 846 non-null float64 8 pr.axis_rectangularity 846 non-null float64 9 max.length_rectangularity 846 non-null int64 10 scaled_variance 846 non-null float64 11 scaled_variance.1 846 non-null float64 12 scaled_radius_of_gyration 846 non-null float64 13 scaled_radius_of_gyration.1 846 non-null float64 14 skewness_about 846 non-null float64 15 skewness_about.1 846 non-null float64 16 skewness_about.2 846 non-null float64 17 hollows_ratio 846 non-null int64 18 class 846 non-null int64 dtypes: float64(14), int64(5) memory usage: 125.7 KB
df_vehicle.head()
| compactness | circularity | distance_circularity | radius_ratio | pr.axis_aspect_ratio | max.length_aspect_ratio | scatter_ratio | elongatedness | pr.axis_rectangularity | max.length_rectangularity | scaled_variance | scaled_variance.1 | scaled_radius_of_gyration | scaled_radius_of_gyration.1 | skewness_about | skewness_about.1 | skewness_about.2 | hollows_ratio | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95 | 48.0 | 83.0 | 178.0 | 72.0 | 10 | 162.0 | 42.0 | 20.0 | 159 | 176.0 | 379.0 | 184.0 | 70.0 | 6.0 | 16.0 | 187.0 | 197 | 2 |
| 1 | 91 | 41.0 | 84.0 | 141.0 | 57.0 | 9 | 149.0 | 45.0 | 19.0 | 143 | 170.0 | 330.0 | 158.0 | 72.0 | 9.0 | 14.0 | 189.0 | 199 | 2 |
| 2 | 104 | 50.0 | 106.0 | 209.0 | 66.0 | 10 | 207.0 | 32.0 | 23.0 | 158 | 223.0 | 635.0 | 220.0 | 73.0 | 14.0 | 9.0 | 188.0 | 196 | 1 |
| 3 | 93 | 41.0 | 82.0 | 159.0 | 63.0 | 9 | 144.0 | 46.0 | 19.0 | 143 | 160.0 | 309.0 | 127.0 | 63.0 | 6.0 | 10.0 | 199.0 | 207 | 2 |
| 4 | 85 | 44.0 | 70.0 | 205.0 | 103.0 | 52 | 149.0 | 45.0 | 19.0 | 144 | 241.0 | 325.0 | 188.0 | 127.0 | 9.0 | 11.0 | 180.0 | 183 | 0 |
#Train data
X = df_vehicle.drop('class',axis=1)
#Test data
y=df_vehicle['class']
B. Standardize the Data.
# Scaling the independent attributes using zscore
X_scaled=X.apply(zscore)
3. Model Building:
xtrain, xtest, ytrain, ytest = train_test_split(X,y, test_size = 0.3, random_state = 10)
3.A Train a base Classification model using SVM.
3.B Print Classification metrics for train data.
def performance_analysis(a,b):
q=[]
q.append(accuracy_score(a, b))
q.append(precision_score(a, b,average="macro"))
q.append(recall_score(a, b,average="macro"))
q.append(f1_score(a, b,average="macro"))
q.append(multiclass_roc_auc_score(a, b, average="macro"))
return q
def test_train_analysis(ytrain,ytest,predict_train,predict_test):
train=performance_analysis(ytrain,predict_train)
test=performance_analysis(ytest,predict_test)
data= { 'train' : train,
'test' :test
}
Name= ['Accuracy', 'Recall', 'Precision', 'F1-score',"roc_auc_score"]
index=Name
df2 = pd.DataFrame(data, index)
df2.reset_index(inplace = True)
display(df2)
def conf_metrix(y,pred):
cm = metrics.confusion_matrix(y, pred, labels=[0, 1,2])
df_cm = pd.DataFrame(cm, index = [i for i in ["Car","Bus","Van"]],
columns = [i for i in ["Car","Bus","Van"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,fmt= 'd')
plt.show()
def multiclass_roc_auc_score(y_test, y_pred, average="macro"):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_auc_score(y_test, y_pred, average=average)
def complete_analysis(X_train,X_test,y_train,y_test,ML):
pred_train = ML.predict(X_train)
pred_test = ML.predict(X_test)
test_train_analysis(y_train,y_test,pred_train,pred_test)
conf_metrix(y_test,pred_test )
print("Classification report on training data=================================")
print(classification_report(y_train,pred_train ))
print("Classification report on test data=================================")
print(classification_report(y_test,pred_test ))
def data_analysis(y,predicted_x):
result=performance_analysis(y,predicted_x)
data= { 'performance' : result,
}
Name= ['Accuracy', 'Recall', 'Precision', 'F1-score',"roc_auc_score"]
index=Name
df2 = pd.DataFrame(data, index)
df2.reset_index(inplace = True)
display(df2)
svc = SVC(random_state=10)
svc=svc.fit(xtrain, ytrain)
complete_analysis(xtrain,xtest,ytrain,ytest,svc)
| index | train | test | |
|---|---|---|---|
| 0 | Accuracy | 0.685811 | 0.649606 |
| 1 | Recall | 0.655186 | 0.619053 |
| 2 | Precision | 0.656347 | 0.617826 |
| 3 | F1-score | 0.649558 | 0.614685 |
| 4 | roc_auc_score | 0.746460 | 0.716827 |
Classification report on training data=================================
precision recall f1-score support
0 0.62 0.48 0.54 147
1 0.79 0.77 0.78 304
2 0.55 0.72 0.63 141
accuracy 0.69 592
macro avg 0.66 0.66 0.65 592
weighted avg 0.69 0.69 0.68 592
Classification report on test data=================================
precision recall f1-score support
0 0.59 0.46 0.52 71
1 0.74 0.77 0.75 125
2 0.53 0.62 0.57 58
accuracy 0.65 254
macro avg 0.62 0.62 0.61 254
weighted avg 0.65 0.65 0.65 254
3C. Apply PCA on the data with 10 components.
3D. Visualize Cumulative Variance Explained with Number of Components.
3E. Draw a horizontal line on the above plot to highlight the threshold of 90%.
from sklearn.decomposition import PCA
pca=PCA(n_components=10,random_state=10)
pca_model=pca.fit_transform(X_scaled)
#calculate the variance
var_explained_per=pca.explained_variance_/np.sum(pca.explained_variance_)
print("Variance_explained_variance=",var_explained_per)
#Cummulative Sum
cum_var_explained=np.cumsum(var_explained_per)
print("Cummuative_vaiance_explained=",cum_var_explained)
plt.plot(cum_var_explained,marker='*',markerfacecolor='black', markersize=8)
plt.axhline(y = .9)
plt.xlabel('n_components')
plt.ylabel('Cummuative_vaiance_explained')
plt.show()
Variance_explained_variance= [0.52854121 0.16943943 0.10697862 0.0663128 0.0515503 0.03034773 0.0201686 0.01247266 0.00902625 0.0051624 ] Cummuative_vaiance_explained= [0.52854121 0.69798064 0.80495926 0.87127206 0.92282236 0.95317009 0.97333869 0.98581135 0.9948376 1. ]
3F. Apply PCA on the data. This time Select Minimum Components with 90% or above variance explained.
As per above graph for forn_component=4 satisfies the minimum component 90or above condition.
pca=PCA(n_components=4, random_state=10)
pca_model_4=pca.fit_transform(X_scaled)
3.G Train SVM model on components selected from above step
3.H Print Classification metrics for train data of above model and share insights
svc_pca3 = SVC()
svc_pca3=svc_pca3.fit(pca_model_4,y)
svc_pca_model = svc_pca3.predict(pca_model_4)
print(classification_report(y,svc_pca_model))
data_analysis(y,svc_pca_model)
conf_metrix(y,svc_pca_model)
precision recall f1-score support
0 0.85 0.66 0.74 218
1 0.83 0.90 0.86 429
2 0.67 0.73 0.70 199
accuracy 0.80 846
macro avg 0.79 0.76 0.77 846
weighted avg 0.80 0.80 0.79 846
| index | performance | |
|---|---|---|
| 0 | Accuracy | 0.796690 |
| 1 | Recall | 0.786608 |
| 2 | Precision | 0.762210 |
| 3 | F1-score | 0.769622 |
| 4 | roc_auc_score | 0.825663 |
4. Performance Improvement:
A. Train another SVM on the components out of PCA. Tune the parameters to improve performance B. Share best Parameters observed from above step.
C.Print Classification metrics for train data of above model and share relative improvement in performance in all the models along with insights.
Grid Search SVM on PCA model
svc_grid = SVC(random_state=10)
param_grid = {'C': [0.1, 1, 10],
'gamma': [1, 0.1, 0.01],
'kernel': ['sigmoid','rbf','linear','poly']}
grid_svc = GridSearchCV(svc_grid, param_grid)
grid_svc_pca=grid_svc.fit(pca_model_4, y)
print("Params",grid_svc_pca.get_params)
print("Best Params",grid_svc_pca.best_params_)
svc_pca=grid_svc_pca.predict(pca_model_4)
print(classification_report(y,svc_pca))
data_analysis(y,svc_pca)
conf_metrix(y,svc_pca)
Params <bound method BaseEstimator.get_params of GridSearchCV(estimator=SVC(random_state=10),
param_grid={'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01],
'kernel': ['sigmoid', 'rbf', 'linear', 'poly']})>
Best Params {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
precision recall f1-score support
0 0.88 0.81 0.84 218
1 0.88 0.92 0.90 429
2 0.78 0.77 0.77 199
accuracy 0.86 846
macro avg 0.85 0.83 0.84 846
weighted avg 0.86 0.86 0.86 846
| index | performance | |
|---|---|---|
| 0 | Accuracy | 0.856974 |
| 1 | Recall | 0.845519 |
| 2 | Precision | 0.833839 |
| 3 | F1-score | 0.839137 |
| 4 | roc_auc_score | 0.878167 |
def summary_table(models,xtrain,xtest,ytrain,ytest):
df_S = pd.DataFrame(index=['Accuracy', 'Recall', 'Precision', 'F1-score',"roc_auc_score"])
for i in models:
x=performance_analysis(ytrain,i.predict(xtrain))
df_S [str(i)[0:3]+"_Train"] = x
y=performance_analysis(ytest,i.predict(xtest))
df_S [str(i)[0:3]+"_Test"]=y
return df_S
def summary_table2(models,x,y):
df_S = pd.DataFrame(index=['Accuracy', 'Recall', 'Precision', 'F1-score',"roc_auc_score"])
for i in models:
y=performance_analysis(y,i.predict(x))
df_S [str(i)[0:9]+"result"]=y
return df_S
models=[svc]
tab1=summary_table(models,xtrain,xtest,ytrain,ytest)
models=[svc_pca3]
tab2=summary_table2(models,pca_model_4,y)
models=[grid_svc_pca]
tab3=summary_table2(models,pca_model_4,y)
tab4=tab2.join(tab3)
tab4.rename(columns={'SVC()result':'SVC_PCA','GridSearcresult':'SVC_PCA_Grid'}, inplace = True)
display(tab1.join(tab4))
| SVC_Train | SVC_Test | SVC_PCA | SVC_PCA_Grid | |
|---|---|---|---|---|
| Accuracy | 0.685811 | 0.649606 | 0.796690 | 0.856974 |
| Recall | 0.655186 | 0.619053 | 0.786608 | 0.845519 |
| Precision | 0.656347 | 0.617826 | 0.762210 | 0.833839 |
| F1-score | 0.649558 | 0.614685 | 0.769622 | 0.839137 |
| roc_auc_score | 0.746460 | 0.716827 | 0.825663 | 0.878167 |
Performance is significantly increased on SVC with PCA Model and GridSearch on SVC PCA model.
5. Data Understanding & Cleaning
sns.pairplot(pd.DataFrame(X))
<seaborn.axisgrid.PairGrid at 0x7c9ed17bb130>
sns.pairplot(pd.DataFrame(pca_model_4))
<seaborn.axisgrid.PairGrid at 0x7c9ed17b91e0>
A. Explain pre-requisite/assumptions of PCA.
B. Explain advantages and limitations of PCA
Advantages-
Limitations-